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^ ■ Abstract 

<N 

_ . Detection of rare variants by resequencing is important for the identification of indi- 

viduals carrying disease variants. Rapid sequencing by new technologies enables low-cost 
resequencing of target regions, although it is still prohibitive to test more than a few 
individuals. In order to improve cost trade-offs, it has recently been suggested to apply 
pooling designs which enable the detection of carriers of rare alleles in groups of individ- 
uals. However, this was shown to hold only for a relatively low number of individuals in 
a pool, and requires the design of pooling schemes for particular cases. 

We propose a novel pooling design, based on a compressed sensing approach, which 
is both general, simple and efficient. We model the experimental procedure and show 
via computer simulations that it enables the recovery of rare allele carriers out of larger 
groups than were possible before, especially in situations where high coverage is obtained 
O I f° r each individual. 



Our approach can also be combined with barcoding techniques to enhance perfor- 



mance and provide a feasible solution based on current resequencing costs. For example, 
when targeting a small enough genomic region (~100 base-pairs) and using only ~ 10 

Q\ \ sequencing lanes and ~10 distinct barcodes, one can recover the identity of 4 rare allele 

^p. ' carriers out of a population of over 4000 individuals. 

> : 
X 

^ . 1 Introduction 



Genome- Wide Association Studies (GWAS) [29( have been successfully used in recent 
years to detect associations between genotype and phenotype, and numerous new alleles 
have been found to be linked to various human traits bd. 46]. However, genotyping 
technologies are limited only to those variants that are predetermined and prioritized 
for typing, which results in a bias towards typing of common alleles. 
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Although many common alleles were lately found to have statistically significant 
associations with different human traits, they were thus far shown to explain only a small 
fraction of most traits' heritability content. This, together with other theoretical and 
empirical arguments, raise the possibility that in fact rare alleles may play a sig nificant 
role in the susceptibility of human individuals to many common diseases [a, |l3|, |34|, |4C ] . 
Discovering and genotyping of rare alleles may therefore be of great bio-medical interest, 
however such studies require genotyping of large human populations - a task considered 
infeasible until recently. 

This state of affairs may change dramatically as we are currently witnessing a rapid 
revolution in genome sequencing due to emerging new technologies. Sequencing through- 
put at a given cost is growing at an exponential rate, in similar to Moore's law for com- 



puter hardware 38j |. At present next-generation sequencing technologies 2g, [28|, |39|] 
utilize massively parallel reading of short genomic fragments to achieve several orders 
of magnitude higher throughput at the same cost as previous Sanger sequencing ma- 
chines [38|]. The availability of cheap, high-throughput rapid sequencing methods leads 



to a change in the way researchers approach various biological problems, as it enables 
addressing questions which were infeasible to be studied before. 

Next-generation sequencing opens the possibility to obtain the genomic sequences 
of multiple individuals along specific regions of interest. This approach, often called 
resequencing, is likely to provide an extensive amount of novel information on human 
genetic variation. In particular, the ability to resequence a large number of individuals 
will enable the study of rare alleles in human populations. Resequencing of large popula- 
tions can thus fill a gap in our knowledge by allowing us to discover and type these rare 
variants, often with frequencies well below 1%, at given predefined regions. Of particular 
interest are regions around loci that have previously been established for involvement in 
disease, as they can be resequenced across a large population to seek novel variations. 

Another important application of interest is resequencing a set of specific known 
Single-Nucleotide-Polymorphisms (SNPs) with low minor allele frequency which are 
known or suspected to be important for a certain trait. In this case we are interested 
in identifying individuals carrying these alleles out of a very large group of individuals. 
For example, this may assist in scanning of large populations for individuals carrying 
certain risk alleles for a potentially lethal disease. For the sake of clarity in this paper 
we focus on this application, although discovery and typing of unknown rare variants 
can also be applied following the same lines. 

Current next-generation sequencing technologies provide throughput on the order 
of millions of reads in a single 'run' or 'lane' [38f] , where a sequence read is typically 
a short consecutive DNA fragment of a few dozens to a few hundreds nucleotides. In 
addition, novel experimental procedures enable targeted selection of pre-defined genomic 
regions prior to sequencing [l|, |25|, |42j. These methods, also called 'hybrid capture' or 
'hybrid selection', enrich significantly the DNA or RNA within the regions of interest 
and minimize the number of reads 'wasted' on fragments residing outside these regions. 
Together, these high-throughput technologies have made the identification of carriers 
over a pre-defined region a feasible, yet still an expensive task. 
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A naive but costly option is to utilize one lane per individual. However, when con- 
sidering a population of hundreds or thousands of individuals, such an approach is pro- 
hibitively expensive. Moreover, since resequencing is typically performed on targeted 
regions rather than the whole genome, throughput requirements to sequence an individ- 
ual are much lower than the capacity of a single lane, thus the naive approach is also 
highly inefficient. 

In such cases, "pooled" sequence runs may offer a more feasible approach. In "pooled 
DNA" experiments, DNA from several individuals is mixed and sequenced together on a 
single sequencing lane. Pooled genotyping has been used to quantify previo usly identified 



variations and study allele frequency distributions in populations 431. l45l. |47||. Given a 
measurement for each allele, it is possible to estimate the average frequency of the allele 
in those individuals participating in the pool. However, traditional pooled sequencing 
was used only to infer the frequency of rare alleles in a population, and did not give 
means to recover the identity of rare allele carriers. In this work we focus on the latter 
task, of identifying rare allele carriers by sequencing pooled DNA. 

The field of group testing 18] aims to tackle this problem of identifying individuals 
carrying a certain trait out of a group, by designing an efficient set of test, i.e., pools. 
This field which dates back to the mid 20 'th century has applications in biology 18l ]. 
streaming algorithms 4l| and communications (See [24] for a comprehensive survey.) 
Recently, several works have tried to use resequencing-based group testing methods in 
order to identify rare allele carriers. 

Prabhu and Pe'er [44| offered to use overlapping pools, elegantly designed based 



on Error-Correcting-Codes, to enable the recovery of a single rare-allele carrier from 
multiple pools. Individuals are represented in multiple pools, where the composition 
of different pools is constructed in a way which provides a unique pooling 'signature' 
for each individual. This carefully designed scheme enables the recovery of a rare allele 
carrier by observing the presence of reads containing the rare allele in these 'signature' 
pools. Their design offers a significant saving in resources, as it enables the recovery of a 
single carrier out of TV individuals, by using only 0(log N) pools. It is, however, limited 
to the case of a single rare allele carrier within the group, and the problem of detecting 
multiple (albeit few) carriers remained unsolved. 

In another approach by Erlich et al. 20J] a clever barcoding scheme combined with 
pooling was used, in order to enable the identification of each sample's genotypes. When 
using barcoding, each sample is "marked" by a unique short sequence identifier, i.e., 
barcode, thus upon sequencing one can identify the origin of each read according to its 
barcode, even when multiple samples are mixed in a single lane. Ideally, one could assign 
a different barcode to each individual sample, and then mix many samples in each lane 
while keeping the identity of each read based on its barcode. However, barcoding is 
a costly and laborious procedure, and one wishes to minimize the number of barcodes 
used. It was therefore suggested in to barcode different pools of samples (rather than 
individual samples), thus allowing the barcode to identify the pool from which a certain 
read was obtained, but not the identity of the specific sample. Efficient algorithms based 
on the Chinese-Remainder-Theorem enable the accurate recovery of rare allele carriers, 



3 



where both the total number of pools and the number of individual samples participating 
in each pool were kept low - the identification of N individual genotypes was obtained 
by using 0(y/N) different pools with 0{s/N) individuals per pool. 

In this work we present a different approach to recovering the identity of individuals 
carrying rare variants, based on Compressed Sensing (CS). CS and group testing are 
intimately connected 23(] , yet this approach was not studied in the context of rare allele 
identification. Our work extends the idea of recovering the identity of rare-allele carriers 
using overlapping pools beyond the single carrier case analyzed in 0|, and deals with 
heterozygous or homozygous rare alleles. The CS pooling approach enables testing of a 
larger cohort of individuals, thus identifying carriers of rarer SNPs. The CS paradigm 
also adapts naturally and efficiently to the addition of barcodes. We treat barcodes as 
splitting a given lane into many different lanes of approximately equal sizes in terms of 
number of reads - thus barcoding effectively boosts our number of lanes. 

Compressed Sensing 0, [3] is a new emerging and very active field of research, with 
foundations in statistics and optimization. New developments, updates and research pa- 



pers in CS appear literally on a daily basis, in various websites (e.g. http : //dsp . rice . edu/ cs) 
and blogs (e.g. |http : //nuit- blanche . blogspot . com/). Applications of CS theory can 
be found in many distantly related fields such as magnetic resonance imaging 37(, single 
pixel cameras 19 1 , geophysics [36 ] , astronomy [4] and multiplexed DNA microarrays 14 ] . 

In CS one wishes to efficiently reconstruct an unknown vector of values x = (xi, xn), 
assuming that x is sparse, i.e., have at most s non-zero entries, for some s <C N. It has 
been shown that x can be reconstructed using k <C N basic operations termed "measure- 
ments" , where a measurement is simply the output y of the dot-product of the (unknown 
vector) x with a known measurement vector m, y = mx. By using the values of these k 
measurements and their corresponding m's, it is then possible to reconstruct the original 
sparse vector x. 

Mapping of group testing into a CS setting is simple. The entries of x contain 
the genotype of each individual at a specific genetic locus and are non-zero only for 
minor allele carriers, thus since we are interested in rare alleles x is indeed sparse. A 
measurement in our setting corresponds to sequencing the DNA of a pool of several 
individuals taken together, hence the measurement vector represents the individuals 
participating in a given pool and the output of the measurement is proportional to the 
total number of rare alleles in the pool. Our basic unit of operation is a single 'run' 
or 'lane', which is used to sequence L pre-defined different loci in the genome, whether 
consecutive in one specific region, taken from different regions or surrounding different 
SNPs. We treat each of the L different loci separately and reconstruct L different vectors 
x, thus the amount of computation increases linearly with the number of loci of interest. 

Formulating the problem in terms of CS opens the door to utilizing this fascinating 
and seemingly almost magical theory for our purposes. In particular, from a theoretical 
perspective one can use CS bounds to estimate the number of samples and lanes needed 
for reconstruction, and the robustness of the reconstruction to noise. From a more prac- 
tical point of view, we can apply numerous algorithms and techniques available for CS 
problems, and benefit from the development of faster and more accurate reconstruction 
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algorithms as the state of the art is constantly improving (see e.g. [2j].) We believe that 
CS is a suitable approach for identifying rare allele carriers, and hope that this paper 
is merely a first step in this direction. 

In this work we present results of extensive simulations which aim to explore the 
benefits and limitations of applying CS for the problem of identifying carriers of rare 
alleles in different scenarios. We provide a detailed model of the experimental procedure 
typical to next generation sequencing and find scenarios in which the benefit of applying 
CS is overwhelmingly large (up to over ~70X improvement) compared to the naive one- 
individual-per-lane approach. We also show that our method can be used in addition 
to barcodes, and provide a significant improvement over applying each of these methods 
by itself. 

The paper is organized as follows: Section [2] presents CS in the context of identifying 
carriers of rare alleles, and the specific details of our proposed pooling design, genotype 
reconstruction algorithm and the noise model reflecting the pooled sequencing process. 
Simulation results are presented in Section [31 and provide evidence for the efficiency of 
our approach along a wide range of parameters. Finally, Section 0] offers conclusions and 
outlines possible directions for future research. 



2 Methods 

We first provide a short overview of CS, followed by a description of its application to 
our problem of identifying rare alleles, and the corresponding mathematical formulation 
including a noise model reflecting the sequencing process. Finally we show how one 
performs reconstruction while utilizing barcoding. 



2.1 The Compressed Sensing Problem 

In a standard CS problem one wishes to reconstruct a sparse vector x of length N, by 
taking k different measurements %)i = rrij ■ x, i = 1, . . . , k. This may be represented as 
solving the following set of linear equations 

Mx = y (1) 

where M is a k x N measurement matrix or sensing matrix, whose rows are the 
different m^'s (as a general rule, we use upper-case letters to denote matrices: M,E, .., 
lower boldface letters to denote vectors: x, y, .. and lower case to denote scalars: x, y, x- L .) 

Typically in CS problems, one wishes to reconstruct x from a small number of 
measurements, i.e. k <C N, hence the linear system (pQ) is under-determined, namely 
there are 'too few' equations or measurements and x cannot be recovered uniquely. 
However, it has been shown that if x is sparse, and M has certain properties, the 
original vector x can be recovered uniquely from Eq. (TT]) 15]. More specifically, a 
unique solution is found in case k > Cslog(N/ s), where C is a constant, and s is the 
is the number of non-zero entries in x. This somewhat surprising result stems from 
the fact that the desired solution x is sparse, thus contains less 'information' than a 
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general solution. Therefore, one can 'compress' the amount of measurements or "sensing" 
operations required for the reconstruction of x. 

A sensing matrix which allows for a correct reconstruction of x posses a property 
known as "Uniform Uncertainty Principle" (UUP) or Restricted Isometry Property 



(RIP) [3, Briefly, UUP states that any subset of the columns of M of size 2s 

form a matrix which is almost orthogonal (although since k < N the columns cannot 
be perfectly orthogonal), which, in practice, makes the matrix M "invertible" for sparse 
vectors x. The construction of a "good" sensing matrix is an easy task when one is able 
to use randomness. As an example of a UUP matrix consider a Bernoulli matrix, namely 
a matrix whose entries are independent random variables set to be 1 or —1 with prob- 
ability It is known that almost any instance of such a random matrix will satisfy 
UUP with an overwhelming probability pjl, Q] (the same is true when each entry in the 
matrix is a standard Gaussian random variable.) 

Once M and y are given, CS aims to find the sparsest possible x which obeys Eq. ([1]). 
This can be written as the following optimization problem: 

x* = argmin\ |x| |o s.t. Mx = y (2) 

X 

where the £q norm | |x| |q = l{ Xi ^o} simply counts the number of non-zero elements 



m x. 



Problem ([2]) involves a non-convex £q term and can be shown to be computationally 



intractable in general 10]. However, another impressive breakthrough of CS theory is 
that one can relax this constraint to the closest convex £ p norm, namely the i\ norm, and 
still get a solution which, under certain conditions, is identical to the solution of problem 
(J2J). Hence the problem is reformulated as the following £\ minimization problem, which 
can be efficiently solved by convex optimization techniques: 

x* = argmin\ |x| 1 1 s.t. Mx = y (3) 

X 

In most realistic CS problems measurements are corrupted by noise, hence Eq. (pTJ) is 
replaced by Mx + ry = y, where r\ = (rjx, . . . , %) are the unknown errors in each of the k 
measurements, and the total measurement noise, given by the £2 norm of r\ is assumed 
to be small. Therefore, the optimization problem is reformulated as follows: 

x* = argmin\ |x| 1 1 s.t. ||Mx — y||2<e (4) 

X 

where e > is set to be the maximal level of noise we are able to tolerate, while still 
obtaining a sparse solution. It is known that CS reconstruction is robust to noise, thus 
adding the noise term e does not cause a breakdown of the CS machinery, but merely 
leads to a possible increase in the number of measurements k [{J. 

Many efficient algorithms are available for problem dH) and enable a practical solution 
even for large matrices, with up to tens of thousands of rows. We have chosen to work 



with the commonly used GPSR algorithm 211 ] . 



L It is common to require orthonormality of the columns therefore the entries should also be divided by \fk. 
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2.2 Rare-Allele Identification in a CS Framework 



We wish to reconstruct the genotypes of N individuals at a specific locus. The genotypes 
are represented by a vector x of length N, where Xi represents the genotype of the i'th 
individual. We denote the reference allele by A, and the alternative allele by B. The 
possible entries of X{ are 0, 1 and 2, representing a homozygous reference allele (AA), a 
heterozygous allele (AB) and a homozygous alternative allele (BB), respectively. Hence, 
Xi counts the number of (alternative) B alleles of the i'th individual, and since we are 
interested in rare minor alleles, most entries Xi are zero. In classic CS the unknown 
variables are typically real numbers. The restriction on x in this case is expected to 
reduce the number of measurements needed for reconstruction and may also enable 
using faster reconstruction algorithms, as it is known that even a weaker restriction, 
namely that all entries are positive, already simplifies the reconstruction problem 171 ] . 

The sensing matrix M is built of k different measurements represented by the rows of 
M. The entry m y - is set to 1 if the fth individual participates in j'th measurement, and 
zero otherwise. Each measurement includes a random subset of individuals, where the 
probability to include a certain individual is 0.5. Hence, M is equivalent to the Bernoulli 
matrix mentioned in Section \2A\ which is known to be a "good" sensing matrbj§ (another 
type of a sensing matrix, in which only elements in each measurement are non-zero 

is considered in Section [3]). 

In practice, measurements are performed by taking equal amounts of DNA from 
the individuals chosen to participate in the specific pool, thus their contribution to 
the mixture is approximately equal. Then, the mixture is amplified using PCR, which 
ensures that the amplification bias generated by the PCR process affects all individuals 
equally 3CJ]. Finally, DNA of each pool is sequenced in a separate lane, and reads are 
mapped back to the reference genome (this may be performed using standard alignment 
algorithms such as MAQ 35J].) For each locus of interest we record the number of reads 
containing the rare allele together with the total number of reads covering this locus in 
each pool, denoted by r. These numbers provide the measurement vector y representing 
the k frequencies obtained for this locus in the k different pools. The measurement 
process introduces various types of noise which we model in the next section. 

For each locus, our goal is to reconstruct the vector x, given the sensing matrix M 
and the measurement vector y, while realizing that some measurement error e is present 
(see Eq. @.) Our experimental design is illustrated in Fig. [IJ and the following section 
describes its mathematical formulation. 



2.3 Mathematical Formulation of Our Model 



The model presented here, including the range of par ameters chosen, aims at reflecting 
the sequencing process by the Illumina technology 26], but may also be applied to other 
next generation technologies as well. It is similar, but not identical, to the model pre- 
sented in 4J]. For clarity of presentation, we first describe our model while ignoring the 



different experimental noise factors, and these are added once the model is established. 
2 This is easily seen by using the simple linear transformation x — > (2x ~ l)/y/k. 
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Figure 1: Schematics of the CS based pooling method. Consider the case of 9 people, out of which 
one is a heterozygotic carrier of the rare SNP (marked green), and another one who is a homozygous 
alternative allele carrier (marked red.) Each sample is randomly assigned to a pool with probability 0.5, 
as described by the sensing matrix. For example, individuals 1, 4, 6, 7 and 9 are assigned to the first pool. 
The DNA of the individuals participating in each pool is mixed, and the fraction of rare alleles in each 
pool is measured. For example, the first pool contains the two carriers, hence the frequency of the B's 
is 1 + 2 out of the 2x9 alleles. The sensing matrix and the resulting frequencies are incorporated into an 
underdetermined set of linear constraints, from which the original rare SNP carriers are reconstructed. 



Let x be the unknown sparse genotypes vector, as described in Section 12.21 The 
fraction of individuals with the rare allele is denoted /, thus the vector x has s = fN 
non-zero elementJE M is a k x N Bernoulli sensing matrix, and denote by M the 
normalized version of M whose entries represent the fraction of each individual's DNA 
in each pool: 

thij = ^]v^ (5) 

Assume that the mixing of DNA is perfect and unbiased, and that each DNA segment 
from each individual in a pool is equally likely to be read by the sequencing machinery. 
Suppose that a read from the i'th pool is drawn from a DNA segment covering our 
desired locus. It is then expected that this read will contain the B allele with probability 
qi = ^rhj • x, where rhj is the i'th row of M (the ^ pre-factor is due to the fact that both 
alleles are sequenced for each individual.) The vector of frequencies of the B allele, for 
each of the k pools is therefore: 

q = ^Mx (6) 

Had we been able to obtain a full and error-free coverage of the DNA present in the 
pool, our measurements would have provided us with the exact value of q. In practice a 
specific position is covered by a limited number of reads, which we denote by r, and the 



3 Here and throughout the paper we define the fraction of individuals (rather than alleles) as the 'rare- 
allele-frequency' - thus the fraction of alternative alleles, assuming Hardy- Weinberg (HW) equilibrium, is in 
fact 1 - VT~~J. 
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number of reads from the rare alleles z out of the total number of reads r is binomially 
distributed Z{ ~ Binomial(r,qi). Generally, one can only control the expected number 
of reads covering a specific locus, as also r is considered a random variable. The main 
cause for variation in r is the different amplification biases for different regions, which 
are effected by properties such as a region's GC-content. The distribution of r over 
different loci depends on the experimental conditions, and was shown to follow a Gamma 



distribution in certain cases 44 ] . We adopt this assumption, draw r for each locus from 
a Gamma distribution r ~ r(-^, 1)j an d apply it to all k pools. 

This binomial sampling process provides measurements which are close, yet not iden- 
tical to the expected frequency of the rare allele rqi, and these fluctuations are regarded 
as sampling noise. Therefore the CS problem formulation is given by (compare to 
Eq. @): 

x* = argmin ||x||i s.t. || — Mx z||2 < e (7) 

xe{0,i,2p 2 r 

Adding noise factors. The model described so far assumes that no noise or bias exist 
in our setting, besides sampling noise which is related to the limited number of reads. 
In a more realistic scenario, we do expect additional noise factors to be present due to 
imperfection in experimental procedures. We have modeled these factors by adding two 
more types of noise: sequencing read errors and errors in DNA preparation. 

Read error models the noise factors introduced throughout the process of sequencing 
by next generation techniques such as Illumina, and reflect the fact that reads obtained 
from the sequencing machine may not match the DNA molecule sampled. This can 
be due to errors in certain bases present in the read itself, mis-alignment of a read 
to a wrong place in the genome, errors introduced by the PCR amplification process 
(which are known to introduce base substitutions in the replicated DNA [3]), or any 
other unknown factors. All of these can be modeled using a single parameter e r , which 
represents the probability that the base read is different from the base of the measured 
sample's DNA at a given locus. The resulting base can be any of the other three different 
nucleotides, however we conservatively assume that the errors will always produce the 
alternative allele B (if, for example the reference allele is 'G' and the alternative allele 
is 'T', we assume that all erroneous reads produce 'T'. In practice, some reads will 
produce 'A' and £ C', and these can be immediately discarded thus reducing the effective 
error rate.) The probability of observing B at a certain read is therefore obtained by a 
convolution of the frequency of B alleles and the read error: 

q=(l-e r )^x + e r (l-^Mx) (8) 

The value of e r may vary as a function of the sequencing technology, library prepa- 
ration procedures, quality controls and alignment algorithms used. Typical values 



of e r , which represent realistic values for Illumina sequencing 3l|], are in the range 
e r ~0.5%— 1%. We assume that e r is known to the researcher, and that it is similar 
across different lanes. In this case, one can correct for the convolution in Eq. ([H]) and 
obtain the following problem: 

x* = ar grain ||x||i s.t. \\-Mx — (-z — e r )/(l — 2e r )||2 < e (9) 
xe{o,i,2} N 2 r 
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Hence the measurement vector in our problem equals y = (-z — e r )/(l — 2e r ) and the 
sensing matrix is \M. If e r is unknown, one may still estimate it, for example by running 
one lane with a single region on a single individual with known genotypes. Alternatively, 
we show in Appendix [X] how to incorporate the estimation of the read error term e r 
within our CS framework from the overlapping pooled sequence data. The noise factors 
described thus far (including sampling noise and read errors) resembles the one proposed 



previously in 441 ]. 




Finally we add to our model one more source of noise, namely DNA preparation 
errors (DP errors.) This error term reflects the fact that in an experimental setting it is 
hard to obtain exactly equal amounts of DNA from each individual. The differences in 
the actual amounts taken result in noise in the measurement matrix M. While M is our 
original zero-one Bernoulli matrix, the actual measurement matrix M' is obtained by 
adding DP errors to each non-zero entry. Hence, the true mixture matrix is M' = M+D, 
where the DP matrix D adds a centered Gaussian random variable to each non-zero entry 
of M: 

if my = 1 (1Q) 
otherwise. 

We consider values of a in the range of — 0.05 reflecting up to ~ 5% average noise on 
the DNA quantities of each sample. The matrix M' is unknown and we only have access 
to M, hence the form of Eq. ([9]) in this case is unchanged. M' takes effect indirectly by 
modifying q, which effects z the actual number of reads from the rare allele. As opposed 
to a classic CS problem in which the sensing matrix is usually assumed to be known 
exactly, DP effectively introduces noise into the matrix itself. We study this effect of 
DP errors in Section [3l and show that a standard CS approach is robust to such noise. 

Targeted region length and coverage considerations. The expected number of reads 
from a certain locus is determined by the total number of reads in a lane R and the 
number of loci covered in a single lane L, and is given by E[r] = (the actual number 
of reads from each locus r follows a Gamma distribution with mean ^.) 

L is determined by the size of the regions and the number of SNPs of interest in a 
given study, and by the ability of targeted selection techniques [l|, 0, 42] to enrich for 



a given small set of regions. We consider L as a parameter and study its effect on the 
results. When we treat different isolated SNPs, L indeed represents the number of SNPs 
we cover, as each read covers one SNP. When interested in contiguous genomic regions, 
however, L should be interpreted as the length of the target region in reads, rather than 
nucleotides, since each read covers many consecutive nucleotides. Therefore one should 
multiply L by the read length. For example, if our reads are of length 50 nucleotides, 
and L is taken to be 100, we in fact cover a genomic region of length 5kb. 

R is defined as the number of reads which were successfully aligned to our regions 
of interest. It is mostly determined by the sequencing technology, and is in the order of 
millions for modern sequencing machines. R is also greatly influenced by the targeted 
selection techniques used, and since these are not perfect, a certain fraction of reads 
might not originate from the desired regions and is thus 'wasted'. The total number of 
reads varies according to experimental protocols, read length and alignment algorithms 
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but is typically on the order of a few millions. Throughout this paper we have fixed R 
to be R = 4000000, representing a rather conservative estimate of a modern Illumina 



genome analyzer's run (e.g. compare to [3a]). and also assuming that targeted selection 



efficiency is very high (it is reported to be ~ 90% in [25].) Other values of R may be easily 
dealt with using our simulation framework, thus adapting to a particular researcher's 
needs. 

Another important and related parameter is the average coverage per individual per 
SNP, denoted c, which is given by: 

R/L f _E[rU 

Our model does not directly use c and it is provided merely as a rough estimate for 
the coverage in Section [3j as it can be easily interpreted and compared to coverage values 
quoted for single sample sequencing experiments. When the total number of reads in a 
pool r is given, the actual coverage obtained for each person in a pool has a distribution 
which is approximately Binomial(r,l/N poo i ec [), where N poo i ei i ~ y is the number of 
individuals in the pool. Therefore the average coverage per individual in a given pool is 
indeed approximately c. 



2.4 Example 

In order to visualize the effect of the three noise factors, i.e., sampling noise, read errors 
and DP errors, Fig. [2] presents the measured values y in a specific scenario. We simulate 
an instance of N = 3000 individuals and rare allele frequency / = 0.1%, tested over 
k = 100 lanes. Hence we have three heterozygotic carriers to be identified, and, in 
the absence of noise, the measurement in each lane should display four levels, which 
correspond to whether 0, 1, 2 or 3 of the carriers are actually present in the specific 
pool. 

In order to display the effect of sampling noise, we consider three values for the 
average coverage c , i.e., number of reads per individual per SNP: 27, 267 and an infinite 
number of reads, which corresponds to zero sampling noise. Each of these three values 
appears on a separate row in Fig. El The panels on the left hand side of Fig. [2] correspond 
to read error e r = 1%, while on the right hand side there is no read error at all. The 
data in all panels contain DP errors with a = 0.05. Each panel also displays the actual 
number of reconstruction errors in each case, namely the Hamming distance between 
the correct vector x and reconstructed vector x* obtained by solving Eq. Q. 

The effect of sampling noise is clearly visible in Fig. [2j An infinite amount of reads 
(lower row; (e),(f)), causes the measurements to be very close to their expected frequency, 
where slight deviations are only due to DP errors and the fact that the pool size is not 
exactly N/2. For a moderate number of reads (c = 267 - middle row; (c),(d)) the 
measurements follow the expected frequency levels when there are no read errors (right 
panel (d)), but this rough quantization completely vanishes for e r = 1% (left panel (c).) 
However, reconstruction was accurate even in this case, because our CS formulation 
(Eq. ©) takes these errors into account and aggregate the information from all lanes to 
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enable reconstruction. When the number of reads per person is small (c = 27 - upper 
row; (a),(b)), the four levels disappear irrespective of the read error. Reconstruction is 
still accurate in the absence of read errors (right panel (b)), and there are 4 errors in the 
reconstructed genotype vector x* when e r = 1% (left panel (a)), which probably implies 
that sampling noise is too high in this case. While a coverage of 27 reads per person is 
overwhelmingly sufficient when sequencing a single individual, it leads to errors in the 
reconstruction when pooling many individuals together. 
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Figure 2: The values measured for 100 different pools, for a specific case of N = 3000 individuals 
and rare allele frequency / = 0.1%, thus 3 individuals carry the alternative allele. Shown are the 
measured rare-allele frequencies in each lane, for different coverage levels, (27, 267 and oo reads per 
pool), and different values of the read error (e r = 0%, 1%). The data in all panels contain DP errors 
with a = 0.05. The dashed lines represents the possible expected frequencies corresponding to 0, 1, 2, 3 
rare-allele carriers in a pool - these are the values we would have obtained in the absence of read error, 
DP errors, and assuming that each pool contains exactly N/2 individuals. The coverage r (i.e. number 
of reads) is the most dominant factor causing deviations of the observed values from their expectancy. 
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2.5 Reconstruction 



We use the Gradient Projection for Sparse Reconstruction algorithm (GPSR) |2l(, to 
solve the optimization problem ([9|). GPSR is designed to solve a slightly different, but 
equivalent, formulation as in Eq. ([9]): 

x* = argmin ||— Mx — (— z — e r )/(l — 2e r )||2 + t||x||i (12) 
x 2 r 

where the parameter r provides the trade-off between the equations fit and the spar- 
sity promoting factor, and is equivalent to specifying the maximal allowed error e. It is 



often desirable in applications to let the parameter r scale with ||M T y|| 00 32|, where 
in our case y = (~z — e r )/(l — 2e r ) corresponds to the measurements. We have chosen 
to adopt this scaling throughout this paper, and have set r = 0.01 1 |M T y| |oo, although 
experimentation with different values of r did not alter results significantly. 

GPSR outputs a sparse vector x* with a few non-zero real entries, but does not use 
the fact that our variables are integers from the set {0, 1, 2}. We therefore perform a post- 
processing step in order to obtain such a solution. Simple rounding of the continuous 
results in x* may obtain such a vectoi@. We chose a different post-processing scheme, 
which yields better performance: we rank all non-zero values obtained by GPSR, and 
round the largest s non-zero values, setting to zero all other N — s values to get the 
vector x* s . We then compute an error term err s = ||^Mx* s — y||2- Repeating this for 
different values of s we select the vector x* s which minimizes the error term err s . Thus 
the final solution's sparsity s is always smaller or equal to the sparsity of the vector 
obtained by GPSR. 



2.6 Utilizing Barcodes 

In this section we describe how a CS -based method can be combined with a barcoding 
strategy resulting in improved performance. A barcode is obtained by attaching to 
the DNA in each sample a unique DNA sequence of about 5 additional nucleotides, 
which enables the unique identification of this sample 27]. Hence samples with different 
barcodes can be mixed together into a single lane, and reads obtained from them can 
be uniquely attributed to the different samples. In a pooled-barcodes design [2(3], the 
DNA in each pool (as opposed to the DNA of a specific individual) is tagged using a 
unique barcode. If fibar 

different barcodes are available, we may apply nt, ar pools to a 
single lane and still identify the pool from which each read originated (although not the 
specific individual.) 

We utilize barcodes by increasing the number of effective lanes while the number of 
reads per lane is decreased. The usage of k lanes and n^ar 

barcodes is simply translated 

into solving problem ([9]) with k x rib ar pools and R/rib ar total reads per lane. Barcodes 
can therefore be combined easily with our CS framework so as to improve efficiency. We 
did not try to estimate the relative cost of barcodes and lanes as it may vary according to 



4 by 'rounding' here we mean reducing to the set {0, 1,2}. Thus any negative number is 'rounded' to zero 
and any number larger than 2 is 'rounded' to 2. 
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lab, timing and technology conditions. We therefore solve the CS problem for different 
(k, nbar) combinations, thus presenting the possible trade-offs. 

2.7 Simulations 

We have run extensive simulations in order to evaluate the performance of our approach. 
Various parameter ranges were simulated, where each set of parameters was tested in 500 
instances. In each simulation we have generated an input genotype vector x, applied 
measurements according to our mathematical model, and have tried to reconstruct x 
from these measurements. In order to evaluate the performance of our approach, one 
needs a measure of reconstruction accuracy, reflecting the agreement between the input 
vector x and the reconstructed CS vector x* (even when executing the naive and costly 
approach of sequencing each individual in a separate lane, one still expects possible 
disagreements between the original and reconstructed vectors due to insufficient coverage 
and technological errors.) Each entry i for which Xi is different from x* is termed a 
reconstruction error, and implies that the genotype for a certain individual was not 
reconstructed correctly, yielding either a false positive (xi = / x*) or a false negative 
(xi ^ = x*). For simplicity, we have chosen to show a simple and quite restrictive 
measure of error: we distinguish between two 'types' of reconstructions - completely 
accurate reconstructions which have zero errors, and reconstructions for which at least 
one error occurred. 

A certain value of the problem's parameters (such as number of individuals, number 
of lanes, read error etc.) is termed "successful" if at least 95% of its instances (i.e. 
475 out of 500) had zero reconstruction errors, namely all individual genotypes were 
reconstructed correctly. Thus, even when testing for a few thousand individuals, we 
require that none of the reference allele carriers will be declared as a rare allele carrier. 
In particular, this requirement guarantees that the False-Discovery- Rate of discovering 
rare-allele carriers will not exceed 0.05. 

Performance is then measured in terms of N max , defined as the maximal number 
of individuals which allow for a "successful" reconstruction, for certain values of the 
problem's parameters. 

3 Results 

To explore the advantages of applying CS for efficiently identifying carriers of rare alleles, 
we performed various computer simulations of the experimental procedure described in 
Section [2j 

In each instance of the simulations we set the following parameters: The number of 
individuals grouped together N varied between 100 and 20000. Rare allele frequency / 
was chosen to be 0.1%, 1% and 2%, thus in each instance we randomly select s = Nf 
carriers, and this determines our input vector x. Since rare allele frequency is low, we 
mostly consider the case of a heterozygous allele (AB), hence x is a binary vector, with 
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l's marking the carrierq£|. The case of a homozygous alternative allele (BB) is presented 
in Section 13.1.21 thus in this case x can also contain the value 2. The number of lanes 
k varied between 10 and 500, and L, the number of targeted loci on the same lane was 
1, 10, 100 and 500 (corresponding to targeted regions of length 100 base-pairs to 50A;6, 
assuming each read is of length 100), which leads to different levels of coverage and 
sampling noise. The other noise factors are kept fixed, with read error e r = 1% and DP 
errors a = 0.05, unless specified otherwise. 

In Section 13.11 we estimate the performance of CS given all relevant noise factors, 
while in Section [3.21 we evaluate the individual effect of each of the three noise factors, 
i.e., sampling noise, read errors and DP errors. Section [3.31 shortly presents the effect 
of using a different sensing matrix in which only ~ y/N individuals participate in each 
pool, instead of ~iV/2. Finally, in Section [3.41 the effect of combining barcodes and CS 
is presented. 

3.1 Performance of the 'standard' experimental setup 

Figures 13161 present N max as a function of k, for different numbers of SNPs sequenced 
together on the same lane. The case / = 0.1% displays a different behavior than / = 1% 
and / = 2% and is considered separately. 

3.1.1 f = 0.1% 

The advantages of CS appear most dramatically in the case of rare alleles, e.g., for 
/ = 0.1% in Fig. O Each panel in Fig. [3] presents N max as a function of k, for dif- 
ferent numbers of SNPs L. The number of rare-allele carriers tested in this case were 
1,2,..., 20, leading to N = 1000, 2000, . . . , 20000. The vertical right axis displays the 
corresponding average coverage c, obtained via Eq. (jll|) . The thick black line in each 
figure is simply the line y = x, demonstrating the performance of the naive approach of 
using a single lane per sample. 

When the number of available lanes is large, we can successfully identify the carriers 
in groups of up to 9000 or 20000 individuals, for k = 500 lanes, and L = 10 or L = 1, 
respectively (Fig. EKa,b).) In case the number of available lanes is small we can still 
identify a single carrier out of 1000 individuals with merely k = 20 lanes, for L = 1 and 
L = 10. (inset in Fig. [3]^a,b).) With k = 30,40 lanes, we can identify 2 or 3 carriers in 
a group of 2000 or 3000, for L = 1, 10, respectively. 

As is evident from the four panels of Fig. El N max decreases as a function of L. 
For example, 500 lanes are sufficient to deal with 20000 individuals for L = 1, but 
only with 1000 individuals for L = 500. This results from insufficient coverage which 
causes an increase in sampling noise. Increasing the number of lanes can overcome this 
under-sampling as the value of N max increases almost linearly with k in most cases. 

In order to quantify the advantage of applying CS we define an "efficiency score", 
presented in Fig. 01 which is simply N max /k, i.e., the number of individuals for which 

5 Assuming a given locus follows HW equilibrium, the expected frequency of homozygous rare allele carriers 
is (1 — \/\ — f) 2 ~ / 2 /4 which is extremely low. 
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Figure 3: The maximal number of individuals N max which allow for a "successful" reconstruction as a 
function of the number of lanes used, for different numbers of loci treated simultaneously. A "successful" 
reconstruction means that for a certain set of parameters at least 475 out of 500 simulations yield zero 
reconstruction errors. The black line is simply the line y = x, demonstrating the performance of the 
naive approach of using a single lane per sample. The vertical right axis displays the corresponding 
average coverage c for every value of N max , obtained via Eq. (fTT|) . The bottom- right panels in (a) 
and (b) are zooming in on the region where the number of lanes is small, which is at present the most 
realistic scenario (in panel (c) N max was constant for low numbers of lanes.) The values of N max in this 
case were taken in units of 1000 individuals, which correspond to single carriers. Cases which appear 
to be missing, e.g. k < 200 for L = 500 simply mean that N max < 1000. 



reconstruction can be performed using the CS approach for a given number of lanes, 
divided by the number of individuals which can be treated using the naive one-individual- 
per-lane approach. Therefore, the higher the score, the more beneficial it is to apply CS 
. The black line in each plot has a value of 1 which corresponds to the naive scenario of 
one individual per lane. When considering up to L = 100 SNPs, the efficiency score is 
around or above 10, and in some cases is as high as 70. 

The axis on the right hand side of Fig. [3] displays the average number of reads per 
person, i.e., average coverage c, for the relevant N max . One important question is related 
to the optimal number of reads which allows for successful reconstruction: the smaller the 
coverage the more SNPs we can test on the same lane, yet we are more prone to (mostly 
sampling) noise. However, one can overcome the effects of low coverage by increasing 
the number of lanes, hence it is interesting to test the performance for each combination 
of coverage c and number of lanes k. In Fig. [5] we present this performance for N = 2000 
individuals and / = 0.1%. For each pair of coverage and number of lanes k we color 
code the percentage of instances for which there were errors in CS reconstruction. An 
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Figure 4: Efficiency score for our approach. The ratio N max /k in our approach and the number treated 
in the naive approach (equal to the number of lanes.) This represents the ratio of saved resources (lanes.) 
Efficiency is highest when a few lanes are used, and decreases gracefully as we use more and more lanes. 
Efficiency is highest when the targeted number of loci is small, as in this case each lane provides very 
high coverage. 



improvement in performance may be achieved both by increasing the coverage and by 
increasing the number of lanes. The white line marks the 95% accuracy threshold. The 
transition between "successful" and "unsuccessful" reconstruction is rather sharp. For 
low coverage, e.g., lower than 100 reads per person, a very high number of lanes is needed 
in order to overcome sampling noise. 
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Figure 5: Performance as a function of the number of lanes k and the average coverage c. Shown is 
the percentage of runs in which (even a single) reconstruction error occurred. A rather sharp transition 
is shown, where above the white line we achieve a completely accurate reconstruction in at least 95% 
of the simulations. Notice the non-linearity of scale in both axes. 
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3.1.2 / = 1%, / = 2% 

Figure [6] presents the results for / = 1% and / = 2%. In this case the values of 
N tested were 100, 200, . . . , 4000 (no successful reconstruction according to our criteria 
was achieved for N > 4000.) The resulting N max is lower than for the case of / = 
0.1%, although still much higher than in the naive approach. Results for L = 1 are 
similar to those of L = 10, namely increasing the coverage does not improve performance 
significantly in this case. The differences between results for / = 1% and / = 2% are 
rather small. The "efficiency score" in this case is lower (see Fig. [7J), and is around 5, 
still offering a considerable saving compared to the naive approach. 

All former simulations considered the case of identifying carriers of a heterozygous 
allele (AB). In order to study the possibility of also identifying homozygous alternative 
alleles (BB) via CS we simulated the following case: 1% of the individuals are BB 
in addition to 1% which are AB (this gives a vastly higher frequency of BB than is 
expected to be encountered in practice, and was taken as an extreme case to test the 
robustness of our reconstruction results.) The results are marked as "1% + 1%" in 
Figs. I6|7l Our CS framework deals with this scenario in exactly the same way as the 
other AB cases, although the results are, as expected, slightly worse than that of 2% 
heterozygous carriers. 



#SNPs: 1 #SNPs: 10 




#lanes #lanes 

Figure 6: The maximal number of individuals N max as a function of the number of lanes. Similar to 
Fig. [3] but with a higher frequency for the rare allele, / = 1% and / = 2%. The number of individuals 
Nmax achieved decreases as we increase the rare allele frequency, but we are still able to treat a much 
larger sample size than the naive approach. For example, one can use 40 lanes for L — 100 and recover 4 
rare-allele carriers out of 400 (/ = 1%, zoomed-in view in panel(c).) The case / = 1% + 1% corresponds 
to 1% AB and 1% of BB alleles. 
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Figure 7: Efficiency score of our approach. Similar to Fig. |4] but with rare allele frequency of 1% and 
2%. As expected, efficiency is decreased as rare allele frequency increases. We still reach up to 13X 
and 7X improvement over the naive approach for rare allele frequencies of 1% and 2% respectively. 



3.2 The effect of noise 

Figures 18191 present the effects of three types of noise in the specific case of / = 1%. 
In both figures the reference is the "standard" performance of / = 1% which appeared 
before in Fig. El and includes sampling noise, read error (e r = 1%) and DP errors 
(<r = 0.05). We consider the case of sampling error separately from the other two 
sources, as its impact is different. 

3.2.1 Sampling error 

Figure [8] compares the "standard" performance to the case where an infinite number 
of reads are available (although read error and DP errors are still present.) Differences 
between the cases appear only when the number of SNPs is high, thus the number of 
reads per person c is insufficient. In these cases N max is reduced by a factor of 2 to 4 
with respect to an infinite coverage. When the number of SNPs is small, and coverage 
is high, we see no difference between the "standard" performance and the infinite read 
case. 

3.2.2 Read errors and DP errors 

Figure [9] compares the "standard" performance to two cases: one in which e r = and 
another in which a = 0. In the absence of read errors Nmax may be twice as large 
as when e r = 1%. Read errors take a significant effect on performance only when L 
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Figure 8: Effect of sampling error. The dashed line represent results obtained in the limit where 
the number of reads goes to infinity thus sampling error is zero. The solid line represents the realistic 
scenario with the current number of reads used. Sampling error is seen to be a significant factor when 
we treat many loci together in the same lane (L = 100 or more), while for a few loci (L = 10 or less) 
we already have enough coverage to make sampling error negligible. 



is large (100 or 500), since when coverage is high read errors are compensated for (see 
Eq. ([9]).) In all cases we have studied the results are quite robust to DP errors, thus 
noise introduced by realistic pooling protocols should be easily overcome by the CS 
reconstruction. 




Figure 9: Effect of read errors and DP errors. The two dashed lines represent results obtained when 
assuming that reads are perfect and DP errors is zero, respectively. The solid line represents a realistic 
scenario with a read error of 1% and DP errors of a = 0.05. While read error appears to have a 
significant factor in reducing N max , the effect of DP errors seems to be negligible. 
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3.3 Modifying the sensing matrix 



In all simulations presented so far we have considered the case where each pool includes 
approximately N/2 individuals. It may be desirable to minimize the number of indi- 
viduals per pool [20|], as this can lead to a faster and cheaper preparation of each pool. 
Here we shortly present the possibility of modifying M into a sparse sensing matrix, 
thus accommodating the requirement of having few individuals per pool. 

Figure [10] presents the results of using only \/N individuals in each pool, for the 
case / = 1% (marked as u \^N,f = 1"). For a small number of loci taken together 
{L = 1 or 10) the former dense Bernoulli(0.5) sensing matrix achieves higher N max 
values. However, when the number of loci is large {L = 100 or 500) and for large number 
of lanes, it is preferable to use sparse pools of size y/~N. The same qualitative behavior 
was observed for / = 0.1% and / = 2%. The success of sparse matrices in recovering the 
true genotypes is not surprising given theoretical and experimental evidence [3]. Further 
research is needed in order to determine the optimal sparsity of the sensing matrix for 
a given set of parameters. 




Figure 10: The effect of applying pools of %/iV or N/2 individuals (on average) for the case of f — 1%. 
Overall, results are comparable, yet each pooling design is preferable for different settings of lanes and 
loci. The sparse (v^V) design is more beneficial for large number of lanes and longer target regions. 
The average coverage on the right axis of each panel corresponds to the N/2 case. The coverage in 
the VN case is much larger since the total number of reads R is divided among a smaller number of 
individuals. 



21 



3.4 Combining barcodes and CS 

Barcodes may also be combined with CS so as to improve efficiency and further reduce 
the number of required lanes. The DNA in each pool (as opposed to the DNA of a 
specific individual) may be tagged using a unique barcode (see Section 12.61 ) Hence, in 
case we have ni, ar different barcodes available, we apply nb ar pools to a single lane, with 
the price being that each pool contains only R/ntar reads. Figure [TT1 displays N max as a 
function of the number of lanes, for different values of ribar, and for different rare allele 
frequencies. 

The black line in each figure represents the "naive" capacity, which is simply k x n\, ar . 
Incorporating even a small number of barcodes into our CS framework results in a dra- 
matic increase in N max . For the same problem parameters but without using barcodes, 
could not recover the minimal possible number of individuals N max = 1000 for / = 0.1% 
and could not reach more than N max = 100 for / = 1%,2% (see Figs. I3|6l ) Similarly 
to the non-barcodes case, the advantage over the naive approach is most prominent for 
/ = 0.1%, but is still significant for / = 1%, 2%. As the number of barcodes increases, 
the difference in performance between different sparsities / becomes smaller. As long 
as the coverage is kept high, it is still beneficial to increase the number of barcodes, as 
it effectively increases linearly the number of lanes. At a certain point, when many dif- 
ferent barcodes are present in a single lane, coverage drops and sampling error becomes 
significant, hence the advantage of adding more barcodes starts to diminish. 
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Figure 11: Results obtained by combining the CS approach with barcoding for L = 1. Barcoding 
improves results by enabling a higher number of 'effective' lanes (although these lanes contain a smaller 
number of reads.) The effect of adding a large number of barcodes is more pronounced for high minor- 
allele frequency. For example, at / = 2% and with 7 lanes, we can treat roughly 300 individuals with 
10 barcodes, but around 2500 individuals with 200 barcodes. The increase in power is sub-linear, as is 
seen by the fact that when we add more barcodes, the performance becomes closer to that of the naive 
approach (shown in black) which increases linearly with the number of lanes. Still, only at a very high 
number of barcodes the naive approach can perform as well as the CS design. 
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4 Discussion 



We have presented a method for identifying rare allele carriers via CS -based group 
testing. The method naturally deals with all possible scenarios of multiple carriers and 
heterozygous or homozygous rare alleles. Our results display the advantages of the 
approach over the naive one-individual-per-lane scenario: it is particularly useful for the 
case of a large number of individuals and low frequencies of rare alleles. We have also 
shown that our method can benefit from the addition of barcodes for different pools 20f ] . 
and still improve upon 'standard' barcoding. 

We view our main contribution as outlining a generic approach that puts together 
sequencing and CS for solving the problem of carrier identification. Following this map- 
ping we may apply the vast amount of CS literature and benefit from any advancement 
in this rapidly growing field. We believe this is a major advantage over more 'tailored' 
approaches, e.g. [2CJ, [4J], although these methods may be superior to ours in specific 
cases. 

Our method is simple in the sense that it applies CS in the most straightforward 
fashion. We have used an off-the-shelf CS solver and did not try to optimize any CS 
related parameter for the different scenarios - all parameters were kept fixed for all types 
of simulations thus optimizing reconstruction algorithms is likely to improve our results. 
Moreover, the method's performance as detailed in Section [3] may be further improved, 
since our formulation of the CS problem has incorporated only part of the available 
information at hand. Using additional information may reduce the number of lanes or 
total reads needed to achieve a certain accuracy, as well as enable faster algorithms for 
reconstruction, thus allowing us to deal with larger sample sizes and larger regions. 

As an example, the information that the input signal is trinary (0, 1, 2) was considered 
only in the post-processing step after running GPSR, although it may be incorporated 
into the optimization procedure itself. Since most alleles are approximately distributed 
according to HW equilibrium, the frequency of 2's, derived from the frequency of l's is 
very low, and the input vector is most often Boolean. Therefore, we could also mod- 
ify the optimization so as to 'punish' for deviations from this pattern. In addition, we 
have treated each rare allele independently, although in case alleles originate from the 
same genomic region, one could use linkage disequilibrium information and reconstruct 
haplotypes. For example, if two individuals share a rare allele at position i, they are 
also likely to share an allele, probably rare, at position i + 1. It would be a challenge to 
add these constraints and enable the reconstruction of several adjacent loci simultane- 
ously. Another improvement may stem from more accurate modeling of specific errors of 
sequencing machines, including quality scores which provide an estimate of error prob- 
ability of each sequenced base [a. |31|. Such careful modeling typically results in far 
smaller error rates than the ones we have conservatively used (~ 0.5% — 1%.) 

Another possible improvement may be to apply adaptive group testing, namely to 
decide whether to use lane k, based on reconstruction results of lanes 1, . . . , k — 1. This 
enables an adaptation to unknown sparsity by simply generating pools one by one, each 
time solving the CS problem and checking if we get a sparse and robust solution. Once 
the solution stabilizes, we can stop our experiments thus not 'wasting' unnecessary lanes 
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(it may also be beneficial to change the pooling design of the next measurement and allow 
deviations from the randomized construction we have shown, once statistical information 
regarding the likely carriers is starting to accumulate). 

The drawbacks of our method stem from the limitations of CS and sequencing tech- 
nology. First of all, in case rare allele frequency is high enough, the sparsity assumption 
at the heart of the CS theory breaks down, and it may be problematic for CS to re- 
construct the signal. The highest frequency possible for CS to perform well in this 
application was not determined, but one should expect a certain frequency above which 
it is no longer beneficial to apply CS and the naive one-individual-per-lane approach is 
preferable. The simulations we have performed estimate this frequency to be over 5% in 
most cases, thus taking effect only for the case of common alleles. 

The number of individuals pooled together and the length of the targeted regions are 
limited by the total capacity of a given lane. We have shown that efficient reconstruction 
is possible when a significant number of individuals is pooled together, and when one 
can target efficiently a rather small genomic region. However, efficiency could be further 
improved if we could maintain a large enough coverage while further increasing the pool 
and region size. We expect that in the foreseeable future sequencing technology would 
yield higher number of reads per lane, which would allow larger pool size and genomic 
regions to be treated by our approach. 

Another difficulty in our approach is related to the issue of randomness of the sensing 
matrix M. This randomness may be discarded by simply fixing a certain instance of the 
sensing matrix, although randomness in this case may be viewed as an advantage of 
CS - almost any (random) matrix would enable reconstruction, as opposed to intricate 
pooling schemes which need to be carefully designed. 

The last drawback we should mention is related to the fact that each pool contains 
approximately half the individuals in the group. This may be problematic in cases where 
pooling preparation might be slow and costly, and we need to minimize the number of 
individuals in each pool [2(3]. In this case it may be interesting to apply a sparser pooling 
design. As shown in Section T3.3l there are scenarios in which it is advantageous to assign 
only \/N individuals to a pool. Therefore, one needs to optimize the pool design together 
with other parameters, e.g. number of loci and lanes considered. This issue remains for 
future study. 

Another major direction we intend to pursue is to test the CS approach experi- 
mentally. To the best of our knowledge, no available data is completely suitable for 
our purposes, thus we can not 'adapt' current data sets and 'simulate' such an experi- 
ment. Our approach is most beneficial for relatively high coverage and a large number 
of individuals, thus designing a suitable experiment is needed. 

Finally, while we have demonstrated the benefits of CS -based group testing approach 
for genotyping, any genetic or epigenetic variant is amenable to our approach, as long as 
it can be detected by next generation sequencing technology and is rare in the population 
of interest. For example, Copy Number Variations (CNVs), important for studying both 
normal population variations and alterations occurring in cancerous tissues, provide a 
natural extension to our framework. In this case the number of reads serves as a proxy 
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to the copy number at a given locus, and the vector to be reconstructed contains the 
(integer) copy number levels of each individual, rather than their genotypes. Another 
example is given by rare translocations, often present in various tumor types - where an 
evidence for a translocation may be provided by a read whose head is mapped to one 
genomic region and whose tail is mapped to another distal region (or by two paired-end 
reads, each originating from a different genomic region.) Carriers of a particular rare 
translocation may be discovered using this approach. The extension of our method to 
these and perhaps other novel applications provides an exciting research direction we 
plan to pursue in the future. 
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A Coping with unknown read error 

We assume that the read error e r is unknown to the researcher, but is constant across 
all lanes. One can introduce a slight modification to our procedure, which enables the 
learning of e r from our pooling data. We replace z in Eq. (|7|) by the convolution: 

z * e r = z + e r — 2ze r (13) 
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The additive factor e r — 2ze r is different for different values of z, but its dominant 
part is e r . We can approximate it by xn+i = e r — 2ze r , obtained from averaging the 
term 2ze r over all z values (i.e. z is the mean value of the vector z). We can now 
reformulate the CS problem by adding one extra variable. Specifically, the unknown 
vector x is replaced by x' = (x, x/v+l) anci Eq. ([7]) is replaced by: 

x'* = argmin\ |x'| |i s.t. || — M'x' z'||2 < e (14) 

x' 2 r 

where M' is built from M by adding a constant column to its right as its N + l's 
column with all its values set to —1. 
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